Iris Grabov & Shay Gelbart
About the task:
This project focuses on predicting house prices using a linear regression model built on a comprehensive housing dataset. The process involved exploring the data to understand its structure, cleaning and preprocessing to address missing values and outliers, and analyzing relationships between features and the target variable. By leveraging advanced techniques in feature engineering and model evaluation, we aimed to build an accurate and interpretable predictive model.
This code sets up the environment for data analysis and machine learning using libraries like numpy, matplotlib, sklearn, and pandas. It customizes plot settings for readability and defines a threshold (min_correlation = 0.2) to filter out features with weak correlations (below 0.2) with the target variable.
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import sklearn
from sklearn import datasets
from sklearn import pipeline, preprocessing
from sklearn import metrics
from sklearn import linear_model
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import seaborn as sns # we will use it for showing the regression line
# define plt settings
plt.rcParams["font.size"] = 20
plt.rcParams["axes.labelsize"] = 20
plt.rcParams["xtick.labelsize"] = 20
plt.rcParams["ytick.labelsize"] = 20
plt.rcParams["legend.fontsize"] = 20
plt.rcParams["figure.figsize"] = (20,10)
min_correlation = 0.2
The code loads the training and test data from CSV files, resets their indices to a default integer range (0 to N), and displays the DataFrames.
train_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
train_df.reset_index(drop=True, inplace=True) # making the indexes to be from 0 to N, where drop resets the index to the default integer index, and inplace modify the df rather than creating a new one.
print("the train data:")
display(train_df)
test_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")
test_df.reset_index(drop=True, inplace=True) # making the indexes to be from 0 to N, where drop resets the index to the default integer index, and inplace modify the df rather than creating a new one.
print("the test data:")
display(test_df)
we have a table for the test and a table for the train: There's 81 columns:
In the initial phase of our work, we will explore and analyze the data. We will examine the features to understand their impact on the target
print("the columns in train:")
train_df.columns
print("the columns in test:")
test_df.columns
print("the train describe:")
train_df.describe()
Explanation of the features in our data we have features that have different types we have numerical, categorical and ordinal features. we want to understand mo eabout our data.
categorical = train_df.dtypes[train_df.dtypes == "object"].index
print("Number of Categorical features: ", len(categorical))
numerical = train_df.dtypes[train_df.dtypes != "object"].index
print("Number of Numerical features: ", len(numerical))
print("numerical:")
print(train_df[numerical].columns)
print("\ncategorical:")
print(train_df[categorical].columns)
Empty values can be '' in string columns, or NaN values.
When we get a new dataset, we fill the empty values with the feature's median or its mean, or remove the rows/columns of those values, if we have enough data.
# check for Nan values in the dataset
print("Nan values in the train:")
train_df.isna().any() # check if there are Nan values
In this dataset, there are missing values. for example: LotFrontage
Let's check the types of the columns.
# check columns type, if the dataset has mix types. type is an object
train_df.dtypes
This output is showing the data types of each column in a DataFrame .
int64: These columns contain integer values. For example:
Id, MSSubClass, LotArea, MoSold, YrSold, and SalePrice are integer columns.float64: This column contains floating-pimal) numbers, such as:
LotFrontage is a float column.object: These columns contain categorical data, usually strings. For example:
MSZoning, SaleType, SaleCondition are categorical columns, which can store values like labels or categories.Length: 81: This indicates the DataFrame hasing) type.e.
# display the dataset info, count, Nan, columns type, etc.
print("train info:")
train_df.info()
print("\ntest info:")
test_df.info()
We want to undesrtand better what is empty and the amount of it, to better handle the data we have at hand.
# Calculate the total number of missing values in each column and sort in descending order
missing_total = train_df.isnull().sum().sort_values(ascending=False)
# Calculate the percentage of missing values for each column and sort in descending order
missing_percentage = (train_df.isnull().sum() / len(train_df)).sort_values(ascending=False)
# Combine the total and percentage of missing values into a single DataFrame
missing_info = pd.concat([missing_total, missing_percentage], axis=1, keys=['Total Missing', 'Percentage'])
# Display the top 20 columns with the most missing data
missing_info.head(20)
In the numerical types we want to fill the data with the mean. The numerical types that have missing data are:
LotFrontage - frontage of the lot in feet. MasVnrArea - area of masonry veneer in square feet. GarageYrBlt - the year the garage was built.
# Fill missing values with the mean for specific columns
columns_to_fill = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea']
# Apply mean to the specified columns for training and testing data
train_df[columns_to_fill] = train_df[columns_to_fill].fillna(train_df[columns_to_fill].mean())
test_df[columns_to_fill] = test_df[columns_to_fill].fillna(test_df[columns_to_fill].mean())
After imputing the missing numerical features with their mean values, you need to handle the remaining missing data. We have different approaches to do it. If we take the PoolQC feature as an example, it would not be correct to remove all the rows that have no pool because this would damage the data.
# List of columns where missing values have a specific meaning (e.g., "None")
cols_with_meaningful_nan = [
'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 'FireplaceQu',
'GarageQual', 'GarageCond', 'GarageFinish', 'GarageType', 'Electrical',
'KitchenQual', 'SaleType', 'Functional', 'Exterior2nd', 'Exterior1st',
'BsmtExposure', 'BsmtCond', 'BsmtQual', 'BsmtFinType1', 'BsmtFinType2',
'MSZoning', 'Utilities'
]
for column in cols_with_meaningful_nan:
most_frequent_value = train_df[column].mode()[0]
train_df[column] = train_df[column].fillna(most_frequent_value)
test_df[column] = test_df[column].fillna(most_frequent_value)
After we do all that, lets check now if we still have missing data
train_df.isnull().sum().sum()
test_df.isnull().sum().sum()
print(test_df.isnull().sum()[test_df.isnull().sum() > 0])
Explanation for each row:
# Fill missing values with the mean for specific columns
columns_to_fill = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageCars', 'GarageArea']
# Apply mean to the specified columns for testing data
test_df[columns_to_fill] = test_df[columns_to_fill].fillna(test_df[columns_to_fill].mean())
The final cheak:
test_df.isnull().sum().sum()
train_df.columns
Transformation
The code calculates the age of the house and garage by subtracting the year they were built from the current year (2024). This transformation makes the features more relevant for modeling, as the age of the property is typically more useful than the year it was built.
for df in [train_df,test_df]:
df['YearBuilt'] = 2024 - df['YearBuilt']
df['GarageYrBlt'] = 2024 - df['GarageYrBlt']
The goal of this code is to visually inspect the relationships between various numerical features and the target variable (SalePrice), while also displaying the correlation and p-value for each feature. This helps to understand the strength and significance of these relationships for feature selection.
# Number of rows and columns in the subplot grid
nr_rows = 15
nr_cols = 2
# Create the figure and axes for the subplots
fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols * 6, nr_rows * 6))
# List of numerical features, excluding the target and 'Id' columns
li_numerical = list(numerical)
li_not = ['Id', 'SalePrice']
#li_plot_numerical = [c for c in li_numerical if c not in li_not]
li_plot_numerical = [col for col in li_numerical if train_df[col].nunique() > 1]
# Create a color palette
colors = sns.color_palette("viridis", n_colors=len(li_plot_numerical))
# Loop through each subplot and create the regression plots
for r in range(nr_rows):
for c in range(nr_cols):
i = r * nr_cols + c
if i < len(li_plot_numerical):
# Create the regression plot for each feature vs the target 'SalePrice'
sns.regplot(
x=train_df[li_plot_numerical[i]],
y=train_df['SalePrice'],
ax=axs[r][c],
scatter_kws={"s": 40, "color": colors[i]}, # Set the scatter point color
line_kws={"color": colors[i]}, # Set the regression line color
)
# Calculate the Pearson correlation coefficient and p-value
stp = stats.pearsonr(train_df[li_plot_numerical[i]], train_df['SalePrice'])
# Format the title with the correlation and p-value
str_title = f"r = {stp[0]:.2f} p = {stp[1]:.2f}"
axs[r][c].set_title(str_title, fontsize=8)
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
To implement the selection of the features, we decided to based on their correlation with SalePrice being above a threshold. min_correlation= 0.2
You can adjust it later based on model performance and the number of features.
# Ensure you have only numeric columns for correlation calculation
numeric_df = train_df.select_dtypes(include=['number'])
# Calculate the correlation matrix
corr = numeric_df.corr()
# Get the absolute values of the correlation matrix
corr_abs = corr.abs()
# Number of numerical columns (excluding the target)
nr_num_cols = len(numerical)
# Get the correlation of all numerical features with 'SalePrice'
ser_corr = corr_abs['SalePrice'].nlargest(nr_num_cols)
# Select features whose correlation is above the min_val_corr threshold
cols_abv_corr_limit = list(ser_corr[ser_corr.values > min_correlation].index)
# Select features whose correlation is below the min_val_corr threshold
cols_below_corr_limit = list(ser_corr[ser_corr.values <= min_correlation].index)
# Print the list of features with correlation above the threshold
print("List of numerical features above min correlation:\n")
print(cols_abv_corr_limit)
This visualization helps understand the distribution of SalePrice for each categorical feature, highlighting differences between categories and any potential outliers.
# List of categorical features
cat_features = list(categorical)
# Set the grid size for the subplots
rows = 22
cols = 2
# Create a figure with the defined grid size
fig, axes = plt.subplots(rows, cols, figsize=(cols * 6, rows * 6))
# Iterate through each subplot to create a boxplot
for row in range(rows):
for col in range(cols):
index = row * cols + col
if index < len(cat_features):
sns.boxplot(x=cat_features[index], y='SalePrice', data=train_df, ax=axes[row][col])
# Set font size for axis labels, ticks, and titles
axes[row][col].tick_params(labelsize=10) # smaller tick labels
axes[row][col].set_xlabel(axes[row][col].get_xlabel(), fontsize=10) # smaller x-axis labels
axes[row][col].set_ylabel(axes[row][col].get_ylabel(), fontsize=10) # smaller y-axis labels
axes[row][col].set_title(axes[row][col].get_title(), fontsize=12) # smaller title
# Adjust the layout to prevent overlapping
plt.tight_layout()
plt.show()
For certain features, it is straightforward to identify a strong correlation: 'Neighborhood', 'Electrical', 'ExterQual', 'MasVnrType', 'BsmtQual', 'MSZoning', 'CentralAir', 'Condition2', 'KitchenQual', 'SaleType'. for example, the 'street' feature have a low relation.
catgerical_strong_correlation = ['Neighborhood', 'Electrical', 'ExterQual', 'MasVnrType', 'BsmtQual',
'MSZoning', 'CentralAir', 'Condition2', 'KitchenQual', 'SaleType']
catgerical_weak_correlation = ['SaleCondition', 'MiscFeature', 'Fence', 'PoolQC', 'PavedDrive',
'GarageCond', 'GarageQual', 'GarageFinish', 'GarageType', 'FireplaceQu',
'Functional', 'HeatingQC', 'Heating', 'BsmtFinType2', 'BsmtFinType1',
'BsmtExposure', 'BsmtCond', 'Foundation', 'ExterCond', 'Exterior2nd',
'Exterior1st', 'RoofMatl', 'RoofStyle', 'HouseStyle', 'BldgType',
'Condition1', 'LandSlope', 'LotConfig', 'Utilities', 'LandContour',
'LotShape', 'Alley', 'Street']
The heatmap visually represents the correlation between numerical features, with darker red indicating a stronger positive correlation. This helps identify which features are highly correlated with each other, which can be useful for feature selection or identifying multicollinearity..
# Show absolute correlation between numerical features in a heatmap
plt.figure(figsize=(17,17))
cor = np.abs(train_df[cols_abv_corr_limit].corr()) # Use only numerical features
# Create the heatmap with smaller annotations
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds, vmin=0, vmax=1, annot_kws={"size": 8}) # Change the font size of annotations
# Display the plot
plt.show()
A correlation heatmap helps identify relationships between features and the target variable. Columns drop based on correlation limits and weak correlation for categorical variables:
# Extract the 'Id' column from df_test
id_test = test_df['Id']
# Define the columns to drop based on correlation limits and weak correlation for categorical variables
to_drop_num = cols_below_corr_limit
to_drop_catg = catgerical_weak_correlation
# Create a list of columns to drop, including 'Id'
cols_to_drop = ['Id'] + to_drop_num + to_drop_catg
# Drop the specified columns from both train and test dataframes
for df in [train_df,test_df]:
df.drop(cols_to_drop, inplace=True, axis=1)
Label Encoding: Transform the categorical data into ordinal data. Translate each category to an integer number. This should be done when there is an order to the values or when there are too many values to handle. To understand batter: sns.violinplot: It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.
One-Hot Encoding: Transform the categorical data into few binary columns. Translate each category into a column with 0 and 1 values (1 if the original categorical value is present in the row, and 0 if not). This should be done when there is no order to the values and where there aren't as many different values in the column. This should be used with regularized regressions (may suffer from bias issues without the added column).
# Set up the figure and axes
plt.figure(figsize=(16, 5))
# Create a violin plot for 'Neighborhood' against the target
sns.violinplot(x='Neighborhood', y='SalePrice', data=train_df)
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Add a title for the plot
plt.title('Violin Plot of Neighborhood vs. SalePrice', fontsize=14)
# Display the plot
plt.tight_layout()
plt.show()
catg_list = catgerical_strong_correlation.copy()
#only for the printing
catg_list.remove('Neighborhood')
for catg in catg_list :
sns.violinplot(x=catg, y='SalePrice', data=train_df)
plt.show()
# Identify non-numeric columns
non_numeric_columns = train_df.select_dtypes(exclude=['number']).columns
# Iterate through non-numeric columns and print unique values
for col in non_numeric_columns:
print(f"Column: {col}")
print(train_df[col].unique())
print("-" * 50)
These columns represent categorical features in the dataset, each containing distinct categories or values. Here's a brief explanation of each:
Label Encoding:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Define ordinal categories and their order
ordinal_categories = {
'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
'BsmtQual': ['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
'CentralAir': ['N', 'Y']
}
# Define nominal columns
nominal_categories = ['MSZoning', 'Neighborhood', 'Condition2', 'MasVnrType', 'Electrical', 'SaleType']
# Step 1: Ordinal Encoding
# Extract only the ordinal columns
ordinal_encoder = OrdinalEncoder(categories=[ordinal_categories[col] for col in ordinal_categories])
# Fit and transform the ordinal columns
train_df[list(ordinal_categories.keys())] = ordinal_encoder.fit_transform(train_df[ordinal_categories.keys()])
test_df[list(ordinal_categories.keys())] = ordinal_encoder.transform(test_df[ordinal_categories.keys()])
One-Hot Encoding:
# Step 2: One-Hot Encoding
# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
# Fit the encoder on the nominal columns
encoded_nominal_train = one_hot_encoder.fit_transform(train_df[nominal_categories])
encoded_nominal_test = one_hot_encoder.transform(test_df[nominal_categories])
# Convert the one-hot encoded data into DataFrames
encoded_train_nominal_df = pd.DataFrame(encoded_nominal_train, columns=one_hot_encoder.get_feature_names_out(nominal_categories))
encoded_test_nominal_df = pd.DataFrame(encoded_nominal_test, columns=one_hot_encoder.get_feature_names_out(nominal_categories))
# Reset index to match train and test DataFrame indices
encoded_train_nominal_df.index = train_df.index
encoded_test_nominal_df.index = test_df.index
# Drop the original nominal columns and concatenate the encoded DataFrames
train_df = pd.concat([train_df.drop(columns=nominal_categories), encoded_train_nominal_df], axis=1)
test_df = pd.concat([test_df.drop(columns=nominal_categories), encoded_test_nominal_df], axis=1)
train_df
This code defines a function, remove_outliers, that removes rows from a DataFrame where the target feature (SalePrice) contains outliers based on the Interquartile Range (IQR) method.
Compute IQR:
SalePrice.SalePrice.Q3 - Q1).Set Outlier Bounds:
Q1 - 1.5 * IQR.Q3 + 1.5 * IQR.Filter Rows:
SalePrice is within bounds (lower_bound ≤ SalePrice ≤ upper_bound).Remove Outliers:
Finally, it removes outliers from train_df based on the SalePrice feature.
def remove_outliers(df, target_feature):
# חישוב IQR עבור המאפיין המספרי
Q1 = target_feature.quantile(0.25)
Q3 = target_feature.quantile(0.75)
IQR = Q3 - Q1
# חישוב הגבולות
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# סינון את הערכים שנמצאים מחוץ לגבולות
mask = (target_feature >= lower_bound) & (target_feature <= upper_bound)
return df[mask]
train_df = remove_outliers(train_df, train_df["SalePrice"])
train_df
This script performs Principal Component Analysis (PCA) and hyperparameter tuning to find the best settings for K-Nearest Neighbors (KNN) Regression and Decision Tree Regression using GridSearchCV.
X) and the target variable (t).train_test_split (80% training, 20% validation).PCA() for dimensionality reduction.KNeighborsRegressor() for regression.pca__n_components).knn__n_neighbors).PCA() for dimensionality reduction.DecisionTreeRegressor() as the model.pca__n_components).dt__max_depth).pca_knn with the best PCA components for KNN.pca_dt with the best PCA components for Decision Tree.X_knn and X_dt) are ready for model training.from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
target_column = 'SalePrice' # target column
# Define the feature columns (everything except the target column)
X = train_df.drop(columns=[target_column]) # Features
t = train_df[target_column] # Target
X = StandardScaler().fit_transform(X)
X_train, X_val, t_train, t_val = train_test_split(X, t, test_size=0.2, random_state=42)
### 1. Find Best PCA + KNN Parameters ###
knn_pipe = Pipeline([
('pca', PCA()),
('knn', KNeighborsRegressor())
])
knn_param_grid = {
'pca__n_components': range(1, 20), # PCA components
'knn__n_neighbors': range(1, 30) # KNN neighbors
}
knn_grid_search = GridSearchCV(knn_pipe, knn_param_grid, cv=5, scoring='neg_root_mean_squared_error')
knn_grid_search.fit(X, t)
# X_train, t_train
# Get Best PCA settings for KNN
best_pca_knn = knn_grid_search.best_params_['pca__n_components']
best_k = knn_grid_search.best_params_['knn__n_neighbors']
print("Best KNN Parameters:", best_k)
print("Best KNN Score (RMSE):", -knn_grid_search.best_score_)
### 2. Find Best PCA + Decision Tree Parameters ###
dt_pipe = Pipeline([
('pca', PCA()),
('dt', DecisionTreeRegressor(random_state=42))
])
dt_param_grid = {
'pca__n_components': range(1, 20), # PCA components
'dt__max_depth': range(1, 20) # Max depth of the tree
}
dt_grid_search = GridSearchCV(dt_pipe, dt_param_grid, cv=5, scoring='neg_root_mean_squared_error')
dt_grid_search.fit(X, t)
# Get Best PCA settings for Decision Tree
best_pca_dt = dt_grid_search.best_params_['pca__n_components']
best_depth = dt_grid_search.best_params_['dt__max_depth']
print("Best Decision Tree depth:", best_depth)
print("Best Decision Tree Score (RMSE):", -dt_grid_search.best_score_)
# 3. Transform Data Using Best PCA for Each Model
pca_knn = PCA(n_components=best_pca_knn)
pca_dt = PCA(n_components=best_pca_dt)
X_knn = pca_knn.fit_transform(X)
X_dt = pca_dt.fit_transform(X)
This script evaluates Bagging and Boosting applied to K-Nearest Neighbors (KNN) and Decision Tree regressors. The models are assessed through bootstrap resampling, with 50 iterations to compute Root Mean Squared Error (RMSE) and R² scores for both training and validation sets.
Model Setup:
best_k and best_depth).Bootstrap Resampling:
Performance Metrics:
Results:
This approach helps assess the stability and generalization capabilities of the models under different resampling scenarios.
from sklearn.utils import resample
from sklearn.metrics import r2_score
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import RandomForestRegressor
# models
knn = KNeighborsRegressor(n_neighbors=best_k)
dt = DecisionTreeRegressor(max_depth=best_depth)
# bagging
knn_bagging = BaggingRegressor(estimator=knn, n_estimators=50, random_state=42, bootstrap=True)
dt_bagging = BaggingRegressor(estimator=dt, n_estimators=50, random_state=42, bootstrap=True)
# boosting
knn_boosting = AdaBoostRegressor(estimator=knn, n_estimators=50, random_state=42)
dt_boosting = AdaBoostRegressor(estimator=dt, n_estimators=50, random_state=42)
rf_model = RandomForestRegressor(criterion='squared_error', n_estimators=100, random_state=42)
estimators = {
"KNN_Bagging" : knn_bagging,
"dt_bagging" : dt_bagging,
"KNN_Boosting": knn_boosting,
"dt_boosting" : dt_boosting,
"Random_Forest" : rf_model
}
n_iterations = 50
for model_name, model in estimators.items():
print(f"\n{'-' * 20}\nModel: {model_name}\n{'-' * 20}")
train_losses = []
val_losses = []
train_r2_scores = []
val_r2_scores = []
if "KNN" in model_name:
X_data = X_knn
else: # Decision Tree models
X_data = X_dt
for iteration in range(n_iterations):
X_train_sample, t_train_sample = resample(X_data, t, replace=True)
X_val_sample, t_val_sample = resample(X_data, t, replace=False)
# Train the model
model.fit(X_train_sample, t_train_sample)
# Compute RMSE for train and validation
train_loss = np.sqrt(mean_squared_error(t_train_sample, model.predict(X_train_sample)))
val_loss = np.sqrt(mean_squared_error(t_val_sample, model.predict(X_val_sample)))
train_losses.append(train_loss)
val_losses.append(val_loss)
# Compute R² for train and validation
train_r2 = model.score(X_train_sample, t_train_sample) # R² on training data
val_r2 = model.score(X_val_sample, t_val_sample) # R² on validation data
train_r2_scores.append(train_r2)
val_r2_scores.append(val_r2)
# Compute averages
mean_train_loss = np.mean(train_losses)
mean_val_loss = np.mean(val_losses)
mean_train_r2 = np.mean(train_r2_scores)
mean_val_r2 = np.mean(val_r2_scores)
print(f"\nResults for {model_name}:")
print(f"Average Training Loss (RMSE): {mean_train_loss:.4f}")
print(f"Average Validation Loss (RMSE): {mean_val_loss:.4f}")
print(f"Average Training R²: {mean_train_r2:.4f}")
print(f"Average Validation R²: {mean_val_r2:.4f}")
# Plot training vs validation loss
plt.figure(figsize=(10, 5))
plt.plot(train_losses, label="Training Loss")
plt.plot(val_losses, label="Validation Loss")
plt.xlabel('Bootstrap Iterations')
plt.ylabel('Loss')
plt.legend()
plt.title(f'{model_name} - Training vs Validation Loss')
plt.show()
You are observing large differences in performance between the train and validation data for all models:
Bagging KNN:
Bagging Decision Tree:
Boosting KNN:
Boosting Decision Tree:
Random Forest:
This code trains a Random Forest Regressor on PCA-transformed data, using the squared_error criterion. It evaluates the model's performance on the validation set by calculating RMSE and R² scores.
Locally Weighted Linear Regression (LWLR) is not suitable for a house price prediction competition for several reasons:
LinearRegression(): Instantiates a simple linear regression model.fit(X_train_scaled, y_train): Trains the model using scaled training data and target values.NE_reg.score: Computes the R² score:# Step 2: Train the model
model = LinearRegression()
NE_reg = model.fit(X_train, t_train)
# calculate R2 score for each group
print('R2 score on train', NE_reg.score(X_train, t_train))
print('R2 score on validation', NE_reg.score(X_val, t_val))
# calculate MSE and RMSE
y_train = NE_reg.predict(X_train)
y_val = model.predict(X_val)
print('MSE on train', metrics.mean_squared_error(y_train, t_train))
print('MSE on validation', metrics.mean_squared_error(y_val, t_val))
print()
print('RMSE on train', metrics.mean_squared_error(y_train, t_train, squared=False))
print('RMSE on validation', metrics.mean_squared_error(y_val, t_val, squared=False))
############################# OUTPUT FOR HOMEWORK 1, LINEAR REGRESSION:
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LinearRegression
# import pandas as pd
# # Example pipeline (Ensure pipeline is defined and trained)
# pipeline_model = Pipeline([
# ('scaler', StandardScaler()),
# ('regressor', LinearRegression())
# ])
# # Train the pipeline
# pipeline_model.fit(X_train, y_train)
# # Predict on test data
# # X_test = test_encoded[high_corr_features] # Ensure these variables are defined
# predictions = pipeline_model.predict(test_df)
# # Create the submission file
# output = pd.DataFrame({'Id': id_test, # Ensure test_ID is defined
# 'SalePrice': predictions})
# output.to_csv('submission.csv', index=False)
# print("Saved the predictions to a .csv file")
################################ OUTPUT FOR HOMEWORK 4, DECISION TREE:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
test_df_scaled = StandardScaler().fit_transform(test_df)
test_df_pca = pca_dt.fit_transform(test_df_scaled)
predictions = dt_boosting.predict(test_df_pca)
# Create the submission file
output = pd.DataFrame({'Id': id_test, # Ensure id_test is defined
'SalePrice': predictions})
output.to_csv('submission.csv', index=False)
print("Saved the predictions to a .csv file")
The code implements a function to evaluate Linear Regression performance across various splits of the data, plotting the results for Mean Squared Error (MSE) and R² score for both training and validation sets.
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import linear_model
from sklearn import model_selection
import matplotlib.pyplot as plt
# פונקציה להדפסת גרפים של MSE ו-R2 בעזרת matplotlib
def print_graphs_r2_mse(graph_points):
for k, v in graph_points.items():
# הדפסת הערכים כדי לוודא שהם משתנים
print(f'{k}: {v}')
# חישוב הערך המקסימלי/המינימלי של כל גרף
best_value = max(v.values()) if 'R2' in k else min(v.values())
best_index = np.argmax(list(v.values())) if 'R2' in k else np.argmin(list(v.values()))
color = 'red' if 'train' in k else 'blue'
# יצירת הגרף עם matplotlib
plt.plot(list(v.keys()), list(v.values()), color=color, label=k)
# הוספת כותרת
plt.title(f'{k}, best value: x={best_index + 1}, y={best_value}')
plt.xlabel('Epochs (Test size)')
plt.ylabel('Score (MSE/R2)')
plt.legend()
plt.show()
# פונקציה לחישוב השגיאה לאורך כל האפוקות
def plot_score_and_loss_by_split(X, t):
graph_points = {
'train_MSE': {},
'val_MSE': {},
'train_R2': {},
'val_R2': {}
}
for size in range(10, 100, 10): # לולאת האפוקות
X_train, X_val, t_train, t_val = model_selection.train_test_split(
X.values, t.values, test_size=size/100, random_state=42)
# אימון המודל
NE_reg = linear_model.LinearRegression().fit(X_train, t_train)
# חישוב תחזיות והערכות
y_train = NE_reg.predict(X_train)
y_val = NE_reg.predict(X_val)
# חישוב MSE ו-R2
graph_points['train_MSE'][size/100] = metrics.mean_squared_error(t_train, y_train)
graph_points['val_MSE'][size/100] = metrics.mean_squared_error(t_val, y_val)
graph_points['train_R2'][size/100] = NE_reg.score(X_train, t_train)
graph_points['val_R2'][size/100] = NE_reg.score(X_val, t_val)
# הדפסת הגרפים
print_graphs_r2_mse(graph_points)
# הנחה שאתה כבר משתמש בנתונים X ו-t
plot_score_and_loss_by_split(train_df.drop(columns=['SalePrice']), train_df['SalePrice'])
The output represents the training and validation performance metrics for different test sizes (from 10% to 90% of the dataset).
MSE (Mean Squared Error):
R² (Coefficient of Determination):
This indicates the model works best with a moderate split (e.g., 10-30% test size) for this dataset
In this project, we developed a predictive model for house prices using a combination of machine learning techniques. Initially, we built a linear regression model and later expanded the approach by incorporating K-Nearest Neighbors (KNN), Decision Tree (both with bagging and boosting), and Random Forest models to improve predictive accuracy. The workflow involved thorough data exploration, preprocessing, feature engineering, and model evaluation to optimize performance.
Data Exploration:
Handling Missing Data:
Outlier Detection and Removal:
Feature Engineering:
SalePrice).Graphical Analysis:
Encoding Categorical Variables:
Model Development & Evaluation:
OverallQual, GrLivArea, and GarageCars had high correlations with house prices.StoneBr, NridgHt) exhibited significantly higher average prices.By incorporating multiple machine learning models, this project demonstrated the impact of ensemble learning on predictive accuracy. Random Forest, KNN with bagging and boosting, and boosted Decision Trees provided the best performance, highlighting the benefits of leveraging diverse algorithms. The use of GridSearchCV for optimizing PCA and hyperparameters significantly improved model efficiency. Future work could focus on further fine-tuning and exploring additional feature selection techniques to enhance the model.